In baseball, pitchers are able to manipulate their grip of the ball to effect the path, speed, and rotation of their pitches. In Major League Baseball, there are a number of different pitches, which each have their own defining characteristics. There are fastballs, which as their name implies, are fast pitches. There are numerous variations of the fastball, such as the four-seam fastball and the sinker, which vary in their velocity and movement. There are also breaking balls, which are slower than fastballs and move laterally or downward. Some examples of breaking balls are curveballs and sliders. In summary, pitchers in baseball have a variety of different pitch grips that they use to vary their pitches in order to decieve batters. For this project, I wanted to train a machine learning model with data from pitches such that the model can identify the grip that the pitcher used to throw the ball.

For this project, I am using data from MLB’s Statcast, which is a service that utilizes radars to track players. To get the data, I used the baseballr package, which you can find at https://billpetti.github.io/baseballr/. This package allows users to scrape baseball data from numerous sources, including Statcast data from https://baseballsavant.mlb.com.

I used the following packages for this project:

library(baseballr)
library(dplyr)
library(ggplot2)
library(caret)
library(tidyverse)
library(plotly)

The first thing that I did in this project was getting the data into my environment. I used the statcast_search function included in the baseballr package. This function downloads Statcast data for every pitch between the specified start and end date. I chose 09/26/2023 as the start date and 10/01/2023 as the end date, and stored the results in the pitches variable. After the pitch data was downloaded, I took a subset of pitches dataset so I was only looking at data relevant to pitches that will help me classify the kind of pitch grip used. I stored this subset in pitches_sub. The variables that I decided to include were: * pitch_type (abbreviation indicating the pitch grip used) * release_speed (in miles/hour) * p_throws (character indicating the pitcher’s handedness) * pfx_x, pfx_z (horizontal and vertical movement of the pitch in feet) * vx0, vz0, vy0 (pitch velocity in feet per second in x, z, and y dimension) * ax, ay, az (pitch acceleration in ft/s^2 in x, y, and z dimension) * spin_axis (spin axis in 2D plane, from 0 to 360)

pitches = statcast_search(start_date = "2023-09-26", end_date = "2023-10-01")

pitches_sub = subset(pitches, select = c(pitch_type, release_speed, p_throws, pfx_x, pfx_z, vx0, vy0, vz0, ax, ay, az, spin_axis))

Here is the head and dimension of pitches_sub:

head(pitches_sub)
dim(pitches_sub)
## [1] 25000    12

One problem is that left handed and right handed pitchers have different movement on their pitches. For example, a slider that is thrown by a left handed pitcher moves to the left (from the catcher’s perspective). A slider thrown by a righty, however, will move to the right. Here is a plot visualizing this:

sliders = pitches_sub %>%
  filter(pitch_type == "SL") %>%
  group_by(p_throws)
ggplot(sliders, aes(x = pfx_x, y = pfx_z, color = p_throws)) +
  geom_point()

As you can see in the plot, the horizontal pitch movement of sliders is essentially opposite for left handed pitchers and right handed pitchers. We would expect a similar difference for all variables in our dataset that represent the movement of the ball horizontally. To adress this, I multiplied pfx_x, vx0, and ax by -1 for pitches thrown by left handed pitchers. For pitch_axis, I set it equal to 360-pitch_axis for left handed pitchers. This way, pitch data is standardized for left handed and right handed pitchers.

#Standardizing data for right handed and left handed pitchers
pitches_sub = pitches_sub %>%
  mutate(spin_axis = case_when(p_throws == "L" ~ (360-spin_axis), p_throws == "R" ~ spin_axis), ax = case_when(p_throws == "L" ~ (-1*ax), p_throws == "R" ~ ax), vx0 = case_when(p_throws == "L" ~ (-1*vx0), p_throws == "R" ~ vx0), pfx_x = case_when(p_throws == "L" ~ (-1*pfx_x), p_throws == "R" ~ pfx_x))

Here is the plot of slider movement now that the data has been standardized:

sliders = pitches_sub %>%
  filter(pitch_type == "SL") %>%
  group_by(p_throws)
ggplot(sliders, aes(x = pfx_x, y = pfx_z, color = p_throws)) +
  geom_point()

Now, the movement in the dataset is essentially the same for sliders thrown by left handed and right handed pitchers.

Now that we have standardized the data for pitches that are thrown by left handed and right handed pitchers, we can remove the p_throws variable from the pitches_sub dataset, as it will not be necessary in classifying pitch types. We can now look at a summary of the data and check for NA values.

#Remove p_throws column
pitches_sub = subset(pitches_sub, select = -c(p_throws))
summary(pitches_sub)
##   pitch_type        release_speed        pfx_x             pfx_z        
##  Length:25000       Min.   : 39.90   Min.   :-1.9600   Min.   :-2.0200  
##  Class :character   1st Qu.: 84.70   1st Qu.:-1.0900   1st Qu.: 0.1400  
##  Mode  :character   Median : 89.90   Median :-0.5600   Median : 0.6400  
##                     Mean   : 89.08   Mean   :-0.3727   Mean   : 0.5894  
##                     3rd Qu.: 94.00   3rd Qu.: 0.2800   3rd Qu.: 1.1800  
##                     Max.   :102.70   Max.   : 2.0300   Max.   : 2.0600  
##                     NA's   :1        NA's   :1         NA's   :1        
##       vx0              vy0               vz0                ax         
##  Min.   :-6.999   Min.   :-149.15   Min.   :-14.947   Min.   :-27.302  
##  1st Qu.: 3.725   1st Qu.:-136.63   1st Qu.: -5.743   1st Qu.:-14.026  
##  Median : 5.649   Median :-130.78   Median : -3.860   Median : -8.088  
##  Mean   : 5.665   Mean   :-129.56   Mean   : -3.768   Mean   : -6.104  
##  3rd Qu.: 7.588   3rd Qu.:-123.21   3rd Qu.: -1.857   3rd Qu.:  2.084  
##  Max.   :19.811   Max.   : -57.48   Max.   : 12.061   Max.   : 19.651  
##  NA's   :1        NA's   :1         NA's   :1         NA's   :1        
##        ay               az            spin_axis    
##  Min.   : 5.272   Min.   :-50.770   Min.   :  0.0  
##  1st Qu.:23.956   1st Qu.:-30.310   1st Qu.:136.0  
##  Median :26.859   Median :-24.393   Median :211.0  
##  Mean   :26.970   Mean   :-24.149   Mean   :179.6  
##  3rd Qu.:29.974   3rd Qu.:-16.671   3rd Qu.:227.0  
##  Max.   :41.311   Max.   : -5.012   Max.   :360.0  
##  NA's   :1        NA's   :1         NA's   :2
#Getting all rows with NA values and printing them
narow = pitches_sub[!complete.cases(pitches_sub), ]
print(narow)
## ── MLB Baseball Savant Statcast Search data from baseballsavant.mlb.com ────────
## ℹ Data updated: 2023-10-31 22:09:26 MST
## # A tibble: 2 × 11
##   pitch_type release_speed pfx_x pfx_z   vx0   vy0   vz0    ax    ay    az
##   <chr>              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "FF"                94.4 -0.71  1.41  7.01 -137. -4.32 -10.6  31.7 -13.6
## 2 ""                  NA   NA    NA    NA      NA  NA     NA    NA    NA  
## # ℹ 1 more variable: spin_axis <dbl>

Here, we see that one row has NA values for all of the variables. This row will not be able to contribute to training a model, so we will remove this row from pitches_sub. The other row with an NA value only has an NA value for the spin axis. While this value may be NA, some of the other values could be important for pitch classification. Therefore, we will set the spin_axis for this row equal to the median spin_axis.

#Delete row with release_speed == NA (removes row with all NA values)
pitches_sub = pitches_sub[!is.na(pitches_sub$release_speed),]
#Set spin_axis == median(spin_axis) for row with missing spin axis value
pitches_sub[is.na(pitches_sub$spin_axis),]$spin_axis = median(pitches_sub$spin_axis, na.rm = T)

summary(pitches_sub)
##   pitch_type        release_speed        pfx_x             pfx_z        
##  Length:24999       Min.   : 39.90   Min.   :-1.9600   Min.   :-2.0200  
##  Class :character   1st Qu.: 84.70   1st Qu.:-1.0900   1st Qu.: 0.1400  
##  Mode  :character   Median : 89.90   Median :-0.5600   Median : 0.6400  
##                     Mean   : 89.08   Mean   :-0.3727   Mean   : 0.5894  
##                     3rd Qu.: 94.00   3rd Qu.: 0.2800   3rd Qu.: 1.1800  
##                     Max.   :102.70   Max.   : 2.0300   Max.   : 2.0600  
##       vx0              vy0               vz0                ax         
##  Min.   :-6.999   Min.   :-149.15   Min.   :-14.947   Min.   :-27.302  
##  1st Qu.: 3.725   1st Qu.:-136.63   1st Qu.: -5.743   1st Qu.:-14.026  
##  Median : 5.649   Median :-130.78   Median : -3.860   Median : -8.088  
##  Mean   : 5.665   Mean   :-129.56   Mean   : -3.768   Mean   : -6.104  
##  3rd Qu.: 7.588   3rd Qu.:-123.21   3rd Qu.: -1.857   3rd Qu.:  2.084  
##  Max.   :19.811   Max.   : -57.48   Max.   : 12.061   Max.   : 19.651  
##        ay               az            spin_axis    
##  Min.   : 5.272   Min.   :-50.770   Min.   :  0.0  
##  1st Qu.:23.956   1st Qu.:-30.310   1st Qu.:136.0  
##  Median :26.859   Median :-24.393   Median :211.0  
##  Mean   :26.970   Mean   :-24.149   Mean   :179.6  
##  3rd Qu.:29.974   3rd Qu.:-16.671   3rd Qu.:227.0  
##  Max.   :41.311   Max.   : -5.012   Max.   :360.0

Let’s take an initial look at some of the variables and how they may relate to pitch type.

ptdata = pitches_sub %>%
  group_by(pitch_type)

#Bar of average pitch speed by pitch type
spdplt = ggplot(ptdata) +
  geom_bar(aes(x=pitch_type, y = release_speed, fill = pitch_type), stat = "summary") + ggtitle("Average Pitch Velocity by Pitch Type")
spdplt

#Scatterplot of pitch movement along x and z axes, grouped by pitch type
movplt = ggplot(ptdata, aes(x = pfx_x, y = pfx_z, color = pitch_type)) +
  geom_point() +
  ggtitle("Pitch Movement Along X and Z axes, Grouped By Pitch Type")
movplt

#3D plot of pitch velocity in x, y, z dimensions, grouped by pitch type
v0plt <- plot_ly(ptdata, x = ~vx0, y = ~vz0, z = ~vy0, color = ~pitch_type)
v0plt <- v0plt %>% add_markers()
v0plt <- v0plt %>% layout(scene = list(xaxis = list(title = 'Velocity in x dimension'),
                     yaxis = list(title = 'Velocity in z dimension'),
                     zaxis = list(title = 'Velocity in y dimension')))
v0plt
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
#3D plot of pitch acceleration in x, y, z dimensions, grouped by pitch type
aplt <- plot_ly(ptdata, x = ~ax, y = ~az, z = ~ay, color = ~pitch_type)
aplt <- v0plt %>% add_markers()
aplt <- v0plt %>% layout(scene = list(xaxis = list(title = 'Acceleration in x dimension'),
                     yaxis = list(title = 'Acceleration in z dimension'),
                     zaxis = list(title = 'Acceleration in y dimension')))

aplt
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
#Bar of average spin axis by pitch type
spinplt = ggplot(ptdata) +
  geom_bar(aes(x=pitch_type, y = spin_axis, fill = pitch_type), stat = "summary") + ggtitle("Average Spin Axis by Pitch Type")
spinplt

Now, we have removed all NA values from our data and have standardized the data for left and right handed pitchers, our data is clean. We can now partition the data into a training and testing set. Here, we will use 80% of the data to train the models and will use 20% of the model for testing. We can look at the dimensions of the training and testing set.

set.seed(150)
split = createDataPartition(pitches_sub$pitch_type, times = 1, p = .8, list = F)
train = pitches_sub[split, ]
test = pitches_sub[-split, ]
dim(train)
## [1] 20006    11
dim(test)
## [1] 4993   11

To avoid overfitting our model to the training set, we will use k-folds cross validation. This method “folds” the data a number of times (specified by parameter “number”). It will then train the model on a certain number of folds of the training set. Each time, it sets aside a fold (from within the training set) for testing. I set it to repeat this process 3 times.

kfolds = trainControl(method="repeatedcv", number = 4, repeats = 3, verboseIter = F)

Now, we can train machine learning models to classify pitches. We will initially train 7 classification models: * rpart (CART) - Recursive Partitioning and Regression Trees * treebag (Bagged CART) - Bagging * rf - Random Forest * C5.0 * lda - Linear Descriminant Analysis * glmnet - Lasso and Elastic-Net Regularized Generalized Linear Models * knn - k Nearest Neighbors

We will train these models and record the accuraccy. After the models finish training, we will assess which models are the most accurate and then will use the models to predict pitch types based on the test set.

#Commented for pdf knitting
# set.seed(19)
# rpMod = train(pitch_type~., data = train, method = "rpart", metric = "Accuracy", trControl = kfolds)
# 
# set.seed(19)
# tbMod = train(pitch_type~., data = train, method = "treebag", metric = "Accuracy", trControl = kfolds)
# 
# set.seed(19)
# rfMod = train(pitch_type~., data = train, method = "rf", metric = "Accuracy", trControl = kfolds)
# 
# 
# set.seed(19)
# c50Mod = train(pitch_type~., data = train, method = "C5.0", metric = "Accuracy", trControl = kfolds)
# 
# set.seed(19)
# ldaMod = train(pitch_type~., data = train, method = "lda", metric = "Accuracy", trControl = kfolds)
# 
# set.seed(19)
# gnMod= train(pitch_type~., data = train, method = "glmnet", metric = "Accuracy", trControl = kfolds)
# 
# set.seed(19)
# knnMod = train(pitch_type~., data = train, method = "knn", metric = "Accuracy", trControl = kfolds)
#Commmented for pdf knitting
# saveRDS(rpMod, "rpMod.rds")
# saveRDS(tbMod, "tbMod.rds")
# saveRDS(rfMod, "rfMod.rds")
# saveRDS(c50Mod,"c50Mod.rds")
# saveRDS(ldaMod, "ldaMod.rds")
# saveRDS(gnMod, "gnMod.rds")
# saveRDS(knnMod, "knnMod.rds")
#Commented for rmd file
rpMod = readRDS("rpMod.rds")
tbMod = readRDS("tbMod.rds")
rfMod = readRDS("rfMod.rds")
c50Mod = readRDS("c50Mod.rds")
ldaMod = readRDS("ldaMod.rds")
gnMod = readRDS("gnMod.rds")
knnMod = readRDS("knnMod.rds")

Now that the models have trained, we can print the information for each model. Take note of the accuracy for each model (look at the highest accuracy measurement for models with tuning parameters).

print(rpMod)
## CART 
## 
## 20006 samples
##    10 predictor
##    16 classes: 'CH', 'CS', 'CU', 'EP', 'FA', 'FC', 'FF', 'FO', 'FS', 'KC', 'KN', 'PO', 'SI', 'SL', 'ST', 'SV' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 3 times) 
## Summary of sample sizes: 15004, 15006, 15005, 15003, 15004, 15005, ... 
## Resampling results across tuning parameters:
## 
##   cp         Accuracy   Kappa     
##   0.1325909  0.6543890  0.55798465
##   0.1827023  0.5570297  0.43035614
##   0.2086860  0.3499020  0.05190027
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.1325909.
print(tbMod)
## Bagged CART 
## 
## 20006 samples
##    10 predictor
##    16 classes: 'CH', 'CS', 'CU', 'EP', 'FA', 'FC', 'FF', 'FO', 'FS', 'KC', 'KN', 'PO', 'SI', 'SL', 'ST', 'SV' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 3 times) 
## Summary of sample sizes: 15004, 15006, 15005, 15003, 15004, 15005, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8345167  0.7950233
print(rfMod)
## Random Forest 
## 
## 20006 samples
##    10 predictor
##    16 classes: 'CH', 'CS', 'CU', 'EP', 'FA', 'FC', 'FF', 'FO', 'FS', 'KC', 'KN', 'PO', 'SI', 'SL', 'ST', 'SV' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 3 times) 
## Summary of sample sizes: 15004, 15006, 15005, 15003, 15004, 15005, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8420477  0.8040507
##    6    0.8421141  0.8042848
##   10    0.8396816  0.8013407
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
print(c50Mod)
## C5.0 
## 
## 20006 samples
##    10 predictor
##    16 classes: 'CH', 'CS', 'CU', 'EP', 'FA', 'FC', 'FF', 'FO', 'FS', 'KC', 'KN', 'PO', 'SI', 'SL', 'ST', 'SV' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 3 times) 
## Summary of sample sizes: 15004, 15006, 15005, 15003, 15004, 15005, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa    
##   rules  FALSE    1      0.8112403  0.7659172
##   rules  FALSE   10      0.8265358  0.7856857
##   rules  FALSE   20      0.8310341  0.7911977
##   rules   TRUE    1      0.8119232  0.7668352
##   rules   TRUE   10      0.8248364  0.7836628
##   rules   TRUE   20      0.8296847  0.7895917
##   tree   FALSE    1      0.8049920  0.7586334
##   tree   FALSE   10      0.8280520  0.7869033
##   tree   FALSE   20      0.8318838  0.7916633
##   tree    TRUE    1      0.8054752  0.7592522
##   tree    TRUE   10      0.8263860  0.7848405
##   tree    TRUE   20      0.8308010  0.7902961
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = tree and winnow
##  = FALSE.
print(ldaMod)
## Linear Discriminant Analysis 
## 
## 20006 samples
##    10 predictor
##    16 classes: 'CH', 'CS', 'CU', 'EP', 'FA', 'FC', 'FF', 'FO', 'FS', 'KC', 'KN', 'PO', 'SI', 'SL', 'ST', 'SV' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 3 times) 
## Summary of sample sizes: 15004, 15006, 15005, 15003, 15004, 15005, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7838488  0.7348521
print(gnMod)
## glmnet 
## 
## 20006 samples
##    10 predictor
##    16 classes: 'CH', 'CS', 'CU', 'EP', 'FA', 'FC', 'FF', 'FO', 'FS', 'KC', 'KN', 'PO', 'SI', 'SL', 'ST', 'SV' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 3 times) 
## Summary of sample sizes: 15004, 15006, 15005, 15003, 15004, 15005, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        Accuracy   Kappa    
##   0.10   0.0007010387  0.8094647  0.7629901
##   0.10   0.0070103868  0.7879347  0.7341118
##   0.10   0.0701038676  0.7273113  0.6505621
##   0.55   0.0007010387  0.8099314  0.7635931
##   0.55   0.0070103868  0.7860350  0.7315013
##   0.55   0.0701038676  0.7066481  0.6218244
##   1.00   0.0007010387  0.8098315  0.7635993
##   1.00   0.0070103868  0.7863352  0.7320381
##   1.00   0.0701038676  0.6414594  0.5337931
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.55 and lambda = 0.0007010387.
print(knnMod)
## k-Nearest Neighbors 
## 
## 20006 samples
##    10 predictor
##    16 classes: 'CH', 'CS', 'CU', 'EP', 'FA', 'FC', 'FF', 'FO', 'FS', 'KC', 'KN', 'PO', 'SI', 'SL', 'ST', 'SV' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 3 times) 
## Summary of sample sizes: 15004, 15006, 15005, 15003, 15004, 15005, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.8342833  0.7949531
##   7  0.8352663  0.7960385
##   9  0.8347829  0.7953894
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.

After printing the model information, here is the ranking of models according to accuracy that I got: Model accuracy (highest to lowest): 1. Random Forest 2. KNN 3. Bagged CART 4. C5.0 5. glmnet 6. LDA 7. CART

Now we can evaluate the accuracy of the model’s predictions on the test set. For this, I will use the Random Forest, KNN, and Bagged CART models. In order to evaluate the predictions, we will use confusion matrices and accuracy.

rf_pred = predict(rfMod, test)
pitch_vals_rf = union(rf_pred, test$pitch_type)
rf_mat = confusionMatrix(data = factor(rf_pred, levels = pitch_vals_rf), reference = factor(test$pitch_type, levels = pitch_vals_rf))
rf_mat
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   SL   FF   CH   SI   FS   FC   ST   CU   KC   FA   SV   EP   KN
##         SL  683    4    2    0    1   92   78   53    5    0    4    0    1
##         FF    0 1549    2   56    1   17    0    0    0    0    0    0    0
##         CH    1    1  552   23   56    2    0    0    0    0    0    0    1
##         SI    0   49   28  749    8    0    0    0    0    0    0    0    0
##         FS    6    0   24    0   34    2    0    0    0    0    0    0    0
##         FC   53   30    1    0    3  241    1    0    0    0    0    0    0
##         ST   69    0    0    0    0    0  124   15    3    0    3    0    0
##         CU   25    0    0    0    0    0   10  249   36    0    3    0    1
##         KC    1    0    0    0    0    0    1    8   24    0    0    0    1
##         FA    0    0    0    0    0    0    0    0    0    0    0    1    0
##         SV    0    0    0    0    0    0    0    1    0    0    1    0    0
##         EP    0    0    0    0    0    0    0    0    0    0    0    1    0
##         KN    0    0    0    0    0    0    0    0    0    0    0    0    0
##         FO    0    0    0    0    0    0    0    0    0    0    0    0    0
##           Reference
## Prediction   FO
##         SL    1
##         FF    0
##         CH    1
##         SI    0
##         FS    1
##         FC    0
##         ST    0
##         CU    0
##         KC    0
##         FA    0
##         SV    0
##         EP    0
##         KN    0
##         FO    0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8426          
##                  95% CI : (0.8322, 0.8526)
##     No Information Rate : 0.3271          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8052          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: SL Class: FF Class: CH Class: SI Class: FS
## Sensitivity             0.8150    0.9486    0.9064    0.9046   0.33010
## Specificity             0.9420    0.9774    0.9806    0.9796   0.99325
## Pos Pred Value          0.7392    0.9532    0.8666    0.8981   0.50746
## Neg Pred Value          0.9619    0.9751    0.9869    0.9810   0.98599
## Prevalence              0.1678    0.3271    0.1220    0.1658   0.02063
## Detection Rate          0.1368    0.3102    0.1106    0.1500   0.00681
## Detection Prevalence    0.1851    0.3255    0.1276    0.1670   0.01342
## Balanced Accuracy       0.8785    0.9630    0.9435    0.9421   0.66167
##                      Class: FC Class: ST Class: CU Class: KC Class: FA
## Sensitivity            0.68079   0.57944   0.76380  0.352941        NA
## Specificity            0.98103   0.98117   0.98393  0.997766 0.9997997
## Pos Pred Value         0.73252   0.57944   0.76852  0.685714        NA
## Neg Pred Value         0.97577   0.98117   0.98351  0.991125        NA
## Prevalence             0.07090   0.04286   0.06529  0.013619 0.0000000
## Detection Rate         0.04827   0.02483   0.04987  0.004807 0.0000000
## Detection Prevalence   0.06589   0.04286   0.06489  0.007010 0.0002003
## Balanced Accuracy      0.83091   0.78030   0.87387  0.675354        NA
##                      Class: SV Class: EP Class: KN Class: FO
## Sensitivity          0.0909091 0.5000000 0.0000000 0.0000000
## Specificity          0.9997993 1.0000000 1.0000000 1.0000000
## Pos Pred Value       0.5000000 1.0000000       NaN       NaN
## Neg Pred Value       0.9979964 0.9997997 0.9991989 0.9993992
## Prevalence           0.0022031 0.0004006 0.0008011 0.0006008
## Detection Rate       0.0002003 0.0002003 0.0000000 0.0000000
## Detection Prevalence 0.0004006 0.0002003 0.0000000 0.0000000
## Balanced Accuracy    0.5453542 0.7500000 0.5000000 0.5000000
knn_pred = predict(knnMod, test)
pitch_vals_knn = union(knn_pred, test$pitch_type)
knn_mat = confusionMatrix(data = factor(knn_pred, levels = pitch_vals_knn), reference = factor(test$pitch_type, levels = pitch_vals_knn))
knn_mat
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   SL   FF   CH   SI   FS   FC   ST   CU   KC   FA   SV   EP   KN
##         SL  665    2    6    0    4   98   68   47    3    0    4    0    2
##         FF    1 1555    3   63    2   28    0    0    0    0    0    0    0
##         CH    5    0  528   17   52    6    0    0    0    0    0    0    0
##         SI    0   43   40  747   11    0    0    0    0    0    0    0    0
##         FS    3    1   32    1   33    1    0    0    0    0    0    0    0
##         FC   62   32    0    0    1  221    0    0    0    0    0    0    0
##         ST   69    0    0    0    0    0  131   14    1    0    2    0    0
##         CU   30    0    0    0    0    0   11  253   32    0    2    0    1
##         KC    2    0    0    0    0    0    3   11   32    0    1    0    1
##         FA    0    0    0    0    0    0    0    0    0    0    0    1    0
##         SV    1    0    0    0    0    0    1    1    0    0    2    0    0
##         EP    0    0    0    0    0    0    0    0    0    0    0    1    0
##         KN    0    0    0    0    0    0    0    0    0    0    0    0    0
##         FO    0    0    0    0    0    0    0    0    0    0    0    0    0
##           Reference
## Prediction   FO
##         SL    1
##         FF    0
##         CH    1
##         SI    0
##         FS    1
##         FC    0
##         ST    0
##         CU    0
##         KC    0
##         FA    0
##         SV    0
##         EP    0
##         KN    0
##         FO    0
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8348         
##                  95% CI : (0.8242, 0.845)
##     No Information Rate : 0.3271         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7954         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: SL Class: FF Class: CH Class: SI Class: FS
## Sensitivity             0.7936    0.9522    0.8670    0.9022  0.320388
## Specificity             0.9434    0.9711    0.9815    0.9774  0.992025
## Pos Pred Value          0.7389    0.9413    0.8670    0.8882  0.458333
## Neg Pred Value          0.9577    0.9767    0.9815    0.9805  0.985775
## Prevalence              0.1678    0.3271    0.1220    0.1658  0.020629
## Detection Rate          0.1332    0.3114    0.1057    0.1496  0.006609
## Detection Prevalence    0.1803    0.3309    0.1220    0.1684  0.014420
## Balanced Accuracy       0.8685    0.9617    0.9243    0.9398  0.656206
##                      Class: FC Class: ST Class: CU Class: KC Class: FA
## Sensitivity            0.62429   0.61215   0.77607  0.470588        NA
## Specificity            0.97952   0.98200   0.98372  0.996345 0.9997997
## Pos Pred Value         0.69937   0.60369   0.76900  0.640000        NA
## Neg Pred Value         0.97156   0.98262   0.98435  0.992717        NA
## Prevalence             0.07090   0.04286   0.06529  0.013619 0.0000000
## Detection Rate         0.04426   0.02624   0.05067  0.006409 0.0000000
## Detection Prevalence   0.06329   0.04346   0.06589  0.010014 0.0002003
## Balanced Accuracy      0.80191   0.79708   0.87989  0.733467        NA
##                      Class: SV Class: EP Class: KN Class: FO
## Sensitivity          0.1818182 0.5000000 0.0000000 0.0000000
## Specificity          0.9993978 1.0000000 1.0000000 1.0000000
## Pos Pred Value       0.4000000 1.0000000       NaN       NaN
## Neg Pred Value       0.9981957 0.9997997 0.9991989 0.9993992
## Prevalence           0.0022031 0.0004006 0.0008011 0.0006008
## Detection Rate       0.0004006 0.0002003 0.0000000 0.0000000
## Detection Prevalence 0.0010014 0.0002003 0.0000000 0.0000000
## Balanced Accuracy    0.5906080 0.7500000 0.5000000 0.5000000
tb_pred = predict(tbMod, test)
pitch_vals_tb = union(tb_pred, test$pitch_type)
tb_mat = confusionMatrix(data = factor(tb_pred, levels = pitch_vals_tb), reference = factor(test$pitch_type, levels = pitch_vals_tb))
tb_mat
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   SL   FF   CH   SI   FS   FC   ST   CU   SV   KC   FA   EP   KN
##         SL  677    1    5    0    4   92   75   57    4    4    0    0    1
##         FF    1 1546    4   61    1   23    0    0    0    0    0    0    0
##         CH    2    2  548   28   60    1    0    0    0    0    0    0    1
##         SI    0   53   28  737    7    0    0    0    0    0    0    0    0
##         FS    3    1   23    1   31    2    0    0    0    0    0    0    0
##         FC   63   30    1    0    0  236    0    0    0    0    0    0    0
##         ST   65    0    0    0    0    0  126   16    5    2    0    0    0
##         CU   26    0    0    0    0    0   11  242    1   36    0    0    1
##         SV    0    0    0    0    0    0    0    1    1    0    0    0    0
##         KC    1    0    0    1    0    0    2   10    0   26    0    0    1
##         FA    0    0    0    0    0    0    0    0    0    0    0    1    0
##         EP    0    0    0    0    0    0    0    0    0    0    0    1    0
##         KN    0    0    0    0    0    0    0    0    0    0    0    0    0
##         FO    0    0    0    0    0    0    0    0    0    0    0    0    0
##           Reference
## Prediction   FO
##         SL    1
##         FF    0
##         CH    1
##         SI    0
##         FS    1
##         FC    0
##         ST    0
##         CU    0
##         SV    0
##         KC    0
##         FA    0
##         EP    0
##         KN    0
##         FO    0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8354          
##                  95% CI : (0.8248, 0.8456)
##     No Information Rate : 0.3271          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7962          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: SL Class: FF Class: CH Class: SI Class: FS
## Sensitivity             0.8079    0.9467    0.8998    0.8901  0.300971
## Specificity             0.9413    0.9732    0.9783    0.9789  0.993661
## Pos Pred Value          0.7351    0.9450    0.8523    0.8933  0.500000
## Neg Pred Value          0.9605    0.9741    0.9860    0.9782  0.985398
## Prevalence              0.1678    0.3271    0.1220    0.1658  0.020629
## Detection Rate          0.1356    0.3096    0.1098    0.1476  0.006209
## Detection Prevalence    0.1845    0.3277    0.1288    0.1652  0.012417
## Balanced Accuracy       0.8746    0.9600    0.9391    0.9345  0.647316
##                      Class: FC Class: ST Class: CU Class: SV Class: KC
## Sensitivity            0.66667   0.58879   0.74233 0.0909091  0.382353
## Specificity            0.97974   0.98159   0.98393 0.9997993  0.996954
## Pos Pred Value         0.71515   0.58879   0.76341 0.5000000  0.634146
## Neg Pred Value         0.97469   0.98159   0.98204 0.9979964  0.991519
## Prevalence             0.07090   0.04286   0.06529 0.0022031  0.013619
## Detection Rate         0.04727   0.02524   0.04847 0.0002003  0.005207
## Detection Prevalence   0.06609   0.04286   0.06349 0.0004006  0.008211
## Balanced Accuracy      0.82320   0.78519   0.86313 0.5453542  0.689654
##                      Class: FA Class: EP Class: KN Class: FO
## Sensitivity                 NA 0.5000000 0.0000000 0.0000000
## Specificity          0.9997997 1.0000000 1.0000000 1.0000000
## Pos Pred Value              NA 1.0000000       NaN       NaN
## Neg Pred Value              NA 0.9997997 0.9991989 0.9993992
## Prevalence           0.0000000 0.0004006 0.0008011 0.0006008
## Detection Rate       0.0000000 0.0002003 0.0000000 0.0000000
## Detection Prevalence 0.0002003 0.0002003 0.0000000 0.0000000
## Balanced Accuracy           NA 0.7500000 0.5000000 0.5000000

After making predictions on the test data with our models, we see that the accuracy of the models remains fairly similar to the accuracy numbers we got after performing k-folds cross validation on the test set. Looking at the confusion matrices, there are a few trends that we can observe. One of them is that the models is less accurate when classifying sweepers (ST). The sweeper was popularized across baseball this year, and it most clearly resembles a slider, but with more lateral movement. The models also struggled classifying knuckle curvebals (KC), which are a variation of sliders. The model could potentially get better at classifying these variations on pitches with larger samples or models that are more personalized for individual pitchers. Another point to note is that the model had a hard time classifying forkballs (FO), eephus (EP), slurve (SV), and knuckleballs (KN). These pitches are very rare, so a larger sample size would also be necessary to improve accuracy for classifying these pitches. Despite this, these models performed fairly well overall. The accuracies of the models selected for testing ranged from .82 to .85 and the most accurate model was the Random Forest model. I would say that the models are generally good classifiers of pitch type. I hope to experiment more with using machine learning for classifying pitches in the future. Hopefully, I will be able to imrove the accuracies of the models and learn more about classification along the way.